Statistical learning: logistic regression

MACS 30100
University of Chicago

February 13, 2017

Titanic

Sinking of the Titanic

Titanic

Titanic

Titanic (1997)

Titanic data

## Classes 'tbl_df', 'tbl' and 'data.frame':    714 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 7 8 9 10 11 ...
##  $ Survived   : int  0 1 1 1 0 0 0 1 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 1 3 3 2 3 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp      : int  1 1 0 1 0 0 3 0 1 1 ...
##  $ Parch      : int  0 0 0 0 0 0 1 2 0 1 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
##   .. ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

A linear regression approach

A linear regression approach

Predicting port of embarkation

Numeric value Port
1 Cherbourg
2 Queenstown
3 Southampton

Predicting port of embarkation

Numeric value Port
1 Queenstown
2 Cherbourg
3 Southampton

Predicting port of embarkation

Numeric value Port
1 Southampton
2 Cherbourg
3 Queenstown

Logistic regression

  • Model the probability of \(Y\) rather than model \(Y\) directly

    \(p(X) = p(\text{survival} = \text{yes} | \text{age})\)

Linear function

Logistic function

Probability of surviving the Titanic

\[p(\text{Survival}) = \frac{e^{\beta_0 + \beta_{1}\text{Age}}}{1 + e^{\beta_0 + \beta_{1}\text{Age}}}\]

Probability of surviving the Titanic

Probability of surviving the Titanic

Generating predicted probabilities

\[p(\text{Survival}) = \frac{e^{\beta_0 + \beta_{1} \times 30}}{1 + e^{\beta_0 + \beta_{1} \times 30}}\]

\[p(\text{Survival}) = \frac{e^{-0.057 + -0.011 \times 30}}{1 + e^{-0.057 + -0.011 \times 30}}\]

\[p(\text{Survival}) = 0.405\]

Odds

Odds of surviving the Titanic

Log-odds

Log-odds of surviving the Titanic

##          term estimate std.error statistic p.value
## 1 (Intercept)  -0.0567   0.17358    -0.327  0.7438
## 2         Age  -0.0110   0.00533    -2.057  0.0397

First differences - age 20 to 30

\[p(\text{Survival}_{30 - 20}) = \frac{e^{\beta_0 + \beta_{1}30}}{1 + e^{\beta_0 + \beta_{1}30}} - \frac{e^{\beta_0 + \beta_{1}20}}{1 + e^{\beta_0 + \beta_{1}20}}\]

\[p(\text{Survival}_{30 - 20}) = \frac{e^{-0.057 + -0.011 \times 30}}{1 + e^{-0.057 + -0.011 \times 30}} - \frac{e^{-0.057 + -0.011 \times 20}}{1 + e^{-0.057 + -0.011 \times 20}}\]

\[p(\text{Survival}_{30 - 20}) = 0.405 - 0.431\]

\[p(\text{Survival}_{30 - 20}) = -0.0267\]

First differences - age 40 to 50

\[p(\text{Survival}_{50 - 40}) = \frac{e^{\beta_0 + \beta_{1}50}}{1 + e^{\beta_0 + \beta_{1}50}} - \frac{e^{\beta_0 + \beta_{1}40}}{1 + e^{\beta_0 + \beta_{1}40}}\]

\[p(\text{Survival}_{50 - 40}) = \frac{e^{-0.057 + -0.011 \times 50}}{1 + e^{-0.057 + -0.011 \times 50}} - \frac{e^{-0.057 + -0.011 \times 40}}{1 + e^{-0.057 + -0.011 \times 40}}\]

\[p(\text{Survival}_{50 - 40}) = 0.353 - 0.379\]

\[p(\text{Survival}_{50 - 40}) = -0.0254\]

Estimating the parameters

Multiple predictors

\[p(X) = \frac{e^{\beta_0 + \beta_{1}X_1 + \dots + \beta_{p}X_{p}}}{1 + e^{\beta_0 + \beta_{1}X_1 + \dots + \beta_{p}X_{p}}}\]

Women and children first

\[p(\text{Survival}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex}}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex}}}\]

##          term estimate std.error statistic  p.value
## 1 (Intercept)  1.27727   0.23017      5.55 2.87e-08
## 2         Age -0.00543   0.00631     -0.86 3.90e-01
## 3     Sexmale -2.46592   0.18538    -13.30 2.26e-40

Women and children first

Predicted probabilities and first differences in multiple variable models

\[p(X) = \frac{e^{\beta_0 + \beta_{1}X_1 + \beta_{2}X_2}}{1 + e^{\beta_0 + \beta_{1}X_1 + \beta_{2}X_2}}\]

  • Change in log-odds
  • Change in predicted probabilities

Age plus fare model

Age plus fare model

Calculating FDs for multiple variable models

  • If continuous, median value
  • If discrete, modal value

Interactive terms

\[p(\text{Survival}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex}}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex}}}\]

\[p(\text{Survival}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex} + \beta_{3} \times \text{Age} \times \text{Sex}}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{2}\text{Sex} + \beta_{3} \times \text{Age} \times \text{Sex}}}\]

Interactive terms

##          term estimate std.error statistic p.value
## 1 (Intercept)   0.5938    0.3103      1.91 0.05569
## 2         Age   0.0197    0.0106      1.86 0.06240
## 3     Sexmale  -1.3178    0.4084     -3.23 0.00125
## 4 Age:Sexmale  -0.0411    0.0136     -3.03 0.00241

Relationship for women

\[p(\text{Survival}_{female}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age} \times 0}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age} \times 0}}\]

\[p(\text{Survival}_{female}) = \frac{e^{\beta_0 + \beta_{1}\text{Age}}}{1 + e^{\beta_0 + \beta_{1}\text{Age}}}\]

Relationship for men

\[p(\text{Survival}_{male}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age} \times 1}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age} \times 1}}\]

\[p(\text{Survival}_{male}) = \frac{e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age}}}{1 + e^{\beta_0 + \beta_{1}\text{Age} + \beta_{3} \times \text{Age}}}\]

\[p(\text{Survival}_{male}) = \frac{e^{\beta_0 + (\beta_{1} + \beta_{3})\text{Age}}}{1 + e^{\beta_0 + (\beta_{1} + \beta_{3})\text{Age}}}\]

Interactive relationship

\[p(\text{Survival}_{female}) = \frac{e^{\beta_0 + \beta_{1}\text{Age}}}{1 + e^{\beta_0 + \beta_{1}\text{Age}}}\]

\[p(\text{Survival}_{male}) = \frac{e^{\beta_0 + (\beta_{1} + \beta_{3})\text{Age}}}{1 + e^{\beta_0 + (\beta_{1} + \beta_{3})\text{Age}}}\]

Interactive relationship

Interactive relationship

Evaluating model accuracy

  • Accuracy/error rate
  • Proportional reduction in error
  • Receiver operating characteristics (ROC) curve and area under the curve (AUC)

Accuracy of predictions

  • Convert predicted probabilities to predictions
  • Threshold value
  • Percentage of predictions that are correct
  • Age only model accuracy rate: \(59.4\%\)
    • Error rate: \(40.6\%\)

Accuracy of predictions

  • Baseline value
    • \(0\%\)
    • \(50\%\)
    • Useless classifier - modal category
  • Baseline for Titanic data - \(59.4\%\)
  • Age-only model - \(59.4\%\)

Accuracy of predictions

Accuracy of predictions

  • Age-only model: 59.4%
  • Age x gender interactive model: 78%

Proportional reduction in error

\[PRE = \frac{E_1 - E_2}{E_1}\]

Proportional reduction in error

\[PRE_{\text{Age}} = \frac{290 - 290}{290}\]

\[PRE_{\text{Age}} = \frac{0}{290}\]

\[PRE_{\text{Age}} = 0\%\]

Proportional reduction in error

\[PRE_{\text{Age x Gender}} = \frac{290 - 157}{290}\]

\[PRE_{\text{Age x Gender}} = \frac{133}{290}\]

\[PRE_{\text{Age x Gender}} = 45.9\%\]

Types of error

Confusion matrix for interactive model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 360  93
##          1  64 197
##                                        
##                Accuracy : 0.78         
##                  95% CI : (0.748, 0.81)
##     No Information Rate : 0.594        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.537        
##  Mcnemar's Test P-Value : 0.0254       
##                                        
##             Sensitivity : 0.849        
##             Specificity : 0.679        
##          Pos Pred Value : 0.795        
##          Neg Pred Value : 0.755        
##              Prevalence : 0.594        
##          Detection Rate : 0.504        
##    Detection Prevalence : 0.634        
##       Balanced Accuracy : 0.764        
##                                        
##        'Positive' Class : 0            
## 

Alternative thresholds

  • Sensitivity/recall

    \(TPR = \frac{\text{Number of actual positives correctly predicted}}{\text{Number of actual positives}}\)

  • Specificity

    \(TNR = \frac{\text{Number of actual negatives correctly predicted}}{\text{Number of actual negatives}}\)

  • Balancing the two
  • Adjusting threshold

Threshold = \(.8\)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 413 253
##          1  11  37
##                                         
##                Accuracy : 0.63          
##                  95% CI : (0.594, 0.666)
##     No Information Rate : 0.594         
##     P-Value [Acc > NIR] : 0.0256        
##                                         
##                   Kappa : 0.117         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.974         
##             Specificity : 0.128         
##          Pos Pred Value : 0.620         
##          Neg Pred Value : 0.771         
##              Prevalence : 0.594         
##          Detection Rate : 0.578         
##    Detection Prevalence : 0.933         
##       Balanced Accuracy : 0.551         
##                                         
##        'Positive' Class : 0             
## 

Many different thresholds

ROC curve

  • Receiver operating characteristic (ROC) curve
  • Plot false positive rate vs. true positive rate
    • \(1 - \text{specificity}\) vs. sensitivity
  • Area under the curve (AUC)

ROC curve